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Abstract 

Hierarchical parametric models consisting of observable and latent vari- 
ables are widely used for unsupervised learning tasks, such as clustering. 
These models can be regular or singular. In the singular case, there is no 
one-to-one relation between a probability function and the parameter. Oth- 
erwise, in the regular case, the Fisher information matrix is positive definite, 
and the estimation accuracy of both observable and latent variables has been 
studied. In the singular case, however, conventional statistical analysis based 
on the inverse Fisher matrix is not applicable, and so an algebraic geometrical 
analysis has been developed and is used to elucidate the Bayes estimation of 
observable variables. The present paper applies this analysis to latent-variable 
estimation and determines its theoretical performance. Our results show that 
the posterior distribution of the parameter in the observable- variable estima- 
tion can be different from the one in the latent-variable estimation, which 
implies that the Markov chain Monte Carlo method based on the parameter 
and the latent variable cannot construct the proper posterior distribution. 
Keywords: unsupervised learning, hierarchical parametric models, Bayes 
statistics, algebraic geometry, singularities 

1 Introduction 

Hierarchical parametric models are employed for unsupervised learning in many 
data-mining and machine-learning applications. Statistical analysis of the models 
plays an important role for not only revealing the theoretical properties but also the 
practical applications. For example, the asymptotic forms of the generalization error 
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Table 1: Estimation classification according to the target variable and the model 
case 



Estimation Target \Model Case 


Regular Case 


Singular Case 


Observable Variable 


Reg-OV estimation 


Sing-OV estimation 


Latent Variable 


Reg-LV estimation 


Sing-LV estimation 



and the marginal likelihood are used for model selection in the maximum-likelihood 
and Bayes methods, respectively [U HTJ [8] . 

Parametric models generally fall into two cases: regular and singular. The 
present paper focuses on the models, the function of which are continuous and 
sufficiently smooth with respect to the parameter. In regular cases, the Fisher in- 
formation matrix is positive definite, and there is a one-to-one relation between the 
parameter and the expression of the model as a probability function. Otherwise, 
the model is singular, and the parameter space includes singularities. Due to these 
singularities, the Fisher information matrix is not positive definite, and so the con- 
ventional analysis methods that rely on its inverse matrix are not applicable. In this 
case, an algebraic geometrical approach can be used to analyze the Bayes method 

P2H3]. 

Hierarchical models have both observable and latent variables. The latent vari- 
ables represent the underlying structure of the model, while the observable ones 
correspond to the given data. For example, unobservable labels in clustering are 
expressed as the latent variables in mixture models, and the system dynamics of 
time-series data is a sequence of the variables in hidden Markov models. Hierar- 
chical models thus have two estimation targets: observable and latent variables. 
The well-known generalization error measures the performance of the prediction of 
a future observable variable. Combining the two model cases and the two estima- 
tion targets, there are four estimation cases, which are summarized in Table HJ We 
will use the abbreviations shown in the table to specify the target variable and the 
model case; for example, Reg-OV estimation stands for estimation of the observable 
variable in the regular case. 

In the present paper, we will focus on the Sing-LV estimation. One of the main 
concerns in unsupervised learning is the estimation of unobservable parts and in 
practical situations, the ranges of the latent variables are unknown, which corre- 
sponds to the singular case. The other estimation cases have already been studied; 
the accuracy of the Reg-OV estimation has been clarified on the basis of the conven- 
tional analysis method, and the results have been used for model selection criteria, 
such as AIC [TJ. The primary purpose for using the algebraic geometrical method 
is to analyze the Sing-OV estimation, and the asymptotic generalization error of 
the Bayes method has been derived for many models [3], [2j [TU1 HSl [T7J [18j [191 120] . 
Recently, an error function for the latent-variable estimation was formalized in a 
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distribution-based manner, and its asymptotic form was determined for the Reg-LV 
estimation of both the maximum likelihood and Bayes methods [H]. Hereinafter, 
the estimation method will be assumed to be the Bayes method unless it is explicitly 
stated otherwise. 

The main contributions of the present paper are summarized as follows: 

1. The algebraic geometrical method for the Sing-OV estimation is applicable to 
the analysis of the Sing-LV estimation. 

2. The asymptotic form of the error function is obtained, and its dominant order 
is larger than that of the Reg-LV estimation. 

3. We found that the posterior parameter distribution of the Sing-LV estimation 
is different from that of the Sing-OV estimation. 

The third result is important for practical applications: in some cases, methods 
based on latent variables, such as the Markov chain Monte Carlo (MCMC) method 
with Gibbs sampling, cannot construct the proper posterior distribution of the pa- 
rameter because the joint distribution of the parameter and the latent variable can 
have a different peak from that of the marginal distribution. Asymptotic analysis 
shows the condition when these methods fail. 

The rest of this paper is organized as follows. The next section summarizes the 
observable-variable estimation, which includes the Reg-OV and the Sing-OV esti- 
mations, and introduces the asymptotic prediction performance based on algebraic 
geometrical analysis. In Section [3j the latent- variable estimation and its evaluation 
function are formulated in a distribution-based manner. Sections H] and [5] show the 
main results: the former section presents the derivation of the asymptotic form of 
the error function in the Sing-LV estimation, and the latter presents an investigation 
of the asymptotic accuracy of mixture models and a demonstration of how to calcu- 
late the form in a concrete model. Finally, Sections [6] and [7] present the discussion 
and conclusions, respectively. 



2 Accuracy of the Observable- Variable Estima- 
tion and the Algebraic Geometrical Analysis 

In this section, we formalize the Bayes method for the observable- variable estimation 
and introduce the algebraic geometrical method, which is applicable to both the 
Reg-OV and the Sing-OV estimations. 
Let a learning model be defined by 

K K 

p(x\w) = ^2p(x,y\w) = ^2p(y\w)p(x\y,w), 

y=l y=l 
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where x G R M is an observable variable, y G {1, . . . , K} is a latent one, and w is 
a parameter in a continuous space. For the discrete x such that x G {1,2,..., M}, 
all results hold by replacing J dx with Y2t=i- Assume that the observable data 
X n = {xi, - • • , x n } are independent and identically distributed from the true model, 
which is expressed as 

K* 

<?( x ) = ^2q(y)<i( x \y)- 

y=l 

Note that the value range of the latent variable y described as [1, ... , K*\ is generally 
unknown and can be different from the one in the learning model. We also assume 
that the true model satisfies the minimality condition: K* is the minimum range of y 
for function representation of q(x). For example, consider a three-component model 
such that q(x\y = 1) ^ q(x\y = 2) = q(x\y = 3) for all x G R . The minimality 
condition requires the two-component expression 

q(x) =q(y = l)q(x\y = 1) + {q(y = 2) + q(y = 3)}q(x\y = 2) 
=q(y = l)q(x\y = 1) + q(y = 2)q(x\y = 2), 

where 2 = {2,3}. 

The present paper focuses on the case in which the true model is in the class of 
the learning model. More formally, there is a set of parameters expressing the true 
model such that 

W l x ={w*;p(x\w*) =q(x)} ±% 

which is referred to as the true parameter set for x. This means that the latent 
variable range satisfies K = K* or K > K*. The former relation corresponds to 
the regular case and the latter one to the singular case. The true parameter set W l x 
includes K\ isolated points in the regular case due to the symmetry of the parameter 
space. On the other hand, it consists of an analytic set in the singular case. We 
explain this structure using the following model settings. 

Example 1 Let a learning model be a two- component mixture model, and let the 
true model be a single component. The mathematical expressions are given by 

p(x\w) =af(x\bi) + (1 - a)f(x\b 2 ), 
q(x) =f(x\b*), 

where x G R 1 , f{x\b) is a component distribution, and w = {0,61,62} such that 
a G [0, 1] and bi,b 2 G .R 1 . 
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The latent variable y describes the labels of the components. The model expression 
with the labels is defined by 



p(x\w) = ^^p(y = k)p(x\y = k,w) = S ^^p{x, y = k\w). 

k=l k=l 

We can confirm that the true parameter set consists of the following analytic set: 

W x ={a = 1, h = b*} U{a = 0,b 2 = b*} U {61 = b 2 = b*}. 

The Fisher information matrix is not positive definite in this set. Moreover, the 
intersection of these subsets are singularities. If the true model has two components, 

q(x) =a*f(x\bl) + (l-a*)f(x\b* 2 ) 

for a* 7^ 0, 1 and b\ 7^ b 2 , the estimation will be in the regular case. Due to 
K\ — 2! = 2, the set consists of two isolated points; 

W x ={(a = a*, b 1 = b* 1 ,b 2 = b* 2 ),(a=l-a*,b 1 = b* 2 , b 2 = &*)}. 

In Bayesian statistics, estimation of the observable variables is defined by 



p(x\X n ) = j p(x\w)p(w\X n )dw, 
Y[ n i=l p{xi\w)^{w-ri) 



p(w\X n ) 



Z(X n ) 



where tp(w;r]) is a prior distribution with the hyperparameter 77, p(w\X n ) is the 
posterior distribution of the parameter, and its normalizing factor is given by 



/n 
Wp^w^w-^dw. 
1=1 



This formulation is available for both the Reg-OV and Sing-OV estimations. The 
estimation accuracy is measured by the average Kullback-Leibler divergence: 



G{n) =E X 



q[x) In ————dx 
p(x\X n ) 



where the expectation is 



E x [f(X n )]= J f(X n )q(X n )dX r 
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Let us define the free energy as 



F(X n ) = -lnZ{X n ), 

which plays an important role in Bayes statistics as a criterion for selecting the 
optimal model. In the Reg-OV estimation, the Bayesian information criterion (BIC) 
[TTj and the minimum-description-length principle (MDL) [8] are both based on the 
asymptotic form of F(X n ). Theoretical studies often analyze the average free energy 
given by 

F x (n)=-nS x + E x [F(X n )] 
, where the entropy function is defined by 

S x = — J q(x)hiq(x)dx. 

The model that minimizes F(X n ) is then selected as optimal from among the can- 
didate models. The energy function Fx{n) allows us to investigate the average 
behavior of the selection. Note that the entropy term does not affect the selec- 
tion result because it is independent of the candidate models. According to the 
definitions, the average free energy and the generalization error have the relation 

G{n) =F x {n+l)-F x {n), 

which implies that the asymptotic form of F(n) also relates to that of G(n). 

The algebraic geometrical analysis [T21 d3] is applicable to both the regular and 
singular cases for deriving the asymptotic form of F x [n). The rest of the section 
discusses the case W x ^ 0, although it is also important to consider the case W x = 
0, where the learning model cannot attain the true model. 

Let us define another Kullback-Leibler divergence, 



/q(xj 
q(x) In — - — -dx 
p(x\w) 



which is assumed to be analytic. We consider the prior distribution ip(w;rj) = 
ijji(w;rj)ijj2{w,v)y where ij}i(w;r]) is a positive function of class C°° and ^{w;^) is 
a nonnegative analytic function. Then, the zeta function of a statistical model is 
given by 



Cx(z) = / H x (w) z Lp(w;r])dw, 
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where z is a complex variable. From algebraic analysis, we know that its poles are 
real, negative, and rational [3]. Let the largest pole and its order be z = —\x and 
vtix-i respectively. The zeta function includes the term 

where f c (z) is a holomorphic function. We define the state density function of t > 

as 

v(t) = J S(t — Hx(w))Lf(w;r])dw. 
The zeta function is its Mellin transform: 

POO 

(x{z) = / v(t)t z dt. 



Moreover, it is known that the inverse Laplace transform of v(t) has the same 
asymptotic form as Fx(n). Following the transforms from (x{z) to Fx{n) through 
v(t), we obtain the asymptotic form 

Fx{n) =Xx Inn — (mx — 1) In Inn + 0(1). 

Using the same coefficients, the asymptotic form of the generalization error is given 
by 

G[n)= : + o — — . (1) 

n nmn \nmn J 

Since the learning model can attain the true model, we can confirm that the gen- 
eralization error converges to zero for n — > oo. The coefficients are Ax = d/2 and 
mx = 1 in the regular case, which shows that the asymptotic form holds in both 
the Reg-OV and Sing-OV estimations. 



3 Formal Definitions of the Latent- Variable Esti- 
mation and its Accuracy 

This section formulates the Bayes latent-variable estimation and an error function 
that measures its accuracy. 

We first consider a detailed definition of a latent variable. Let Y n = {y 1 , . . . , y n } 
be unobservable data, which correspond to the latent parts of the observable X n . 
Then, the complete form of the data is (xi,yi), and (X n ,Y n ) and X n are referred 
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to as complete and incomplete data, respectively. The true model generates the 
complete data (X n , Y n ), where the range of the latent variables is yi G {1, . . . , K*}. 
The learning model, on the other hand, has the range j/j G {1, . . . , K}. For a unified 
description, we define that the true model has probabilities q(y) = and q(x, y) — 
for y > K*. 

We define the true parameter set for (x, y) as 

W XY = {w*;p(x,y\w*) = q(x,y)}, 

which is part of W x . In Example [TJ 

W XY ={0 = 1,6! = 6*}. 

The subsets {a = 0, b 2 = b*} and {&i = b 2 = b*} in W x are excluded since W XY 
takes account of the representation with respect to not only x but also y. Due to the 
assumption W x ^ 0, W XY is not empty The set W XY again consists of an analytic 
set in the singular case, and it is a unique point in the regular case. 

While latent-variable estimation falls into various types according to the target 
of the estimation, the present paper focuses on the Type-I estimation of [H]: the 
joint probability of (y\, . . . , y n ) is the target and is written as p(Y n \X n ). The Bayes 
estimation has two equivalent definitions: 



p(Y n \X n ) = [ f] P ( X ; ,yilw) p(w\X n )dw (2) 
J f = \ p{xi\w) 





w) 


p(Xi 


w) 



_Z{X n ,Y n ) 
Z(X n ) ' 

where the marginal likelihood for the complete data is given by 



(3) 



/n 
Y[p(xi,yi\w)<p(w,Ti)dw. 
i=i 



The true probability of Y n is uniquely given by 

The accuracy of the estimation is measured by the difference between q(Y n \X n ) 
and p(Y n \X n ). Thus, we define the error function as the average Kullback-Leibler 
divergence, 

q(Y n \X n ) 



D(n) =-E XY 
n 



In 



p(Y n \X* 



(5) 



where the expectation is defined as 
Exy 



[f(X n ,Y n )}= f J2---J2f(X n ,Y n )q(X n ,Y n )dX n . 

yi=l J/n=l 



s 



4 Asymptotic Analysis of the Error Function 

In this section, we show that the algebraic geometrical analysis is applicable to the 
Sing-LV estimation, and present the asymptotic form of the error function D{n). 
Let us define the zeta function on the complete data (x, y) as 



Let the largest pole of (xy(z) be z = —Xxy, and let its order be rrixY- 
We consider the following conditions: 

(Al) The divergence functions Hxy(w) and Hx(w) are analytic. 

(A2) The prior distribution has the compact support, which includes W x , and has 
the expression (p(w; rf) = if>x(w; rf)tp2{w] rf), where if>i(w; rf) > is a function of 
class C°° and ip2{w; rf) > is analytic on the support of <p(w; rf). 

The following theorem is the main result of the present paper: 

Theorem 2 Let the true distribution of the latent variables and the estimated dis- 
tribution be defined by Eqs. andj^} respectively. By assuming the conditions (Al) 
and (A2), the asymptotic form of D{n) is expressed as 




where the Kullback-Leibler divergence Hxy{w) is given by 




D(n) =(\xy - Ax) 



Inn 



n 



(mxY - m x ) 



In Inn 



n 



+ o 



( 



In Inn 



n 



) 



Proof: Let us define another average free energy as 



F X y(ji) 



-uSxy + Exy -hi Z(X n ,Y n ) 



where the entropy function is given by 




y=l 
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According to the definitions of the error function D(n) and the Bayes estimation 
method Eq. [3], it holds that 



nD(n) =Exy 



In 



q{X n ,Y n ) 



Z(X n ,Y r 



E 



x 



In 



- — nS X Y — E X y 
--F XY {n) - F x {n). 



\nZ(X n ,Y n ) 



q(X n ) 
Z(X n \ 



+ nS x + E x 



lnZ(X n ) 



Based on (Al), (A2), and algebraic geometrical analysis, we obtain the asymptotic 
forms of F X y (n) and F x (n) : 

Fxy(n) =\xy Inn — [rn X Y — 1) In Inn + 0(1), 
F x (n) =Xx Inn — {mx — 1) In Inn + 0(1), 



which proves the theorem. (End of Proof) 



5 Accuracy of the Sing-LV Estimation in Mixture 
Models 

In this section, we derive the asymptotic error of the Bayes latent- variable estimation 
for mixture models. In Theorem [5J the possible dominant order was calculated as 
In n/n. However, there is no guarantee that this is the actual maximum order; the 
order can decrease to a constant if the zeta functions C,xy{z) and Cx{z) have their 
largest poles in the same position and their multiple orders are also the same. The 
result in this section clearly shows that the dominant order is In n/n in the mixture 
models. 

5.1 Model Settings 

Assume that a learning model is a mixture of distributions described by 

K 

p(x\w) = J ^2a k f(x\b k ), (6) 

fc=i 

where the components are defined by the distribution function / of x, which is a 
regular model, and the parameter w that consists of the mixing ratios {02, . . . , clk}, 
and d c is the dimension of the component parameters such that e R dc . The 
mixing ratios have constraints a k > and J2k=i a k = 1- We regard a\ as a function 
of the parameters a± — 1 — Ylk=2 a k- 
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The true model is expressed as 

K* 

q(x)=J2<f(x\ b l), (7) 

k=l 

where a* k and b* k are all constants. The number of components is assumed to be 
K* < K. The true model generates data (X n ,Y n ), where Y n = {y 1 , . . . ,y n } G 
{1, . . . , K*} n are the component labels, which are estimated on the basis of incom- 
plete data X n . 

A prior distribution is defined by 

(p{w)rj)=(p{a)T)i)(p{b)r) 2 ), (8) 
<p(a;Vi) =Dir s (a;?7i), (9) 

where a = {ai, . . . , or-}, b = {bi, . . . , V = {^i,^}, and Dir s stands for a 
symmetric Dirichlet distribution that includes the common hyperparameter rji for 
all afc. It is easily confirmed that it satisfies the condition (A2). 

5.2 Asymptotic Error of the Mixture Model 

The following theorem shows the dominant order of the error function for the mixture 
model. 

Theorem 3 Let the learning and the true models be mixtures defined by Eqs. and 
respectively. Assume the conditions (Al) and (A2). The Bayes estimation for the 
latent variables, Eq. [3J with the prior represented by Eqs. [21 and\Q has the following 
bound for the asymptotic error: 

, , (K — K*)rii Inn /lnn\ 

D(n) > V - — — + o . 

2 n \ n J 

Due to the definition of the Dirichlet distribution, r/i is positive. Combining this 
with the assumption K* < K, we obtain that the coefficient of (In n)/n is positive, 
which indicates that it is the dominant order. 

5.3 Proof of Theorem [3] 

Let us first introduce some useful lemmas for the zeta function. The proofs are 
omitted because they are almost obvious due to the relation between the free energy 
and the zeta function. 

Lemma 4 Let the largest poles of the zeta functions J Hi(w) z ip(w)dw and 
J H 2 (w) z 'ip(w)dw be z = — Ai and z = — A 2 , respectively. It holds that Ai < X 2 
when Hi(w) < H 2 (w) on the support of (p(w). 
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Lemma 5 Under the same conditions as Lemma [^} it holds that Ai = A2 if there 
exist positive constants C\ and C2 such that CiH 2 (w) < Hi(w) < C2H 2 (w). 

We define an equivalence relation H\{w) = £[2(10) due to Ai = A2 in Lemma [5] 
Second, the following lemma shows the calculation of Xxy- 

Lemma 6 The largest pole of the zeta function Cxy{z) is 

K* — 1 4- K*d 
Xxy =- ±±!Lh + {K - K*) m , 

m X Y =1- 

Proof of Lemma [6t We consider a restricted parameter space Wi, which is a 
neighborhood of Wxy given by 

a k =a* k (2<k< K*), 
a k =0 (k>K*), 
b k =b* k (l<k<K*), 

because poles of the zeta function do not depend on other areas. The Kullback- 
Leibler divergence has the expression 

H XY {w)=Y j al\\n a ^+ [ f{x\bl)\J-j^\dx 

K * r a* r 



fix 


K) 


f(x 





Based on the shift transformation $1(10), such that 

d k =a k -a* k (2<k<K*), 
dk =a k (k > K*), 

hm =hm -Km C 1 < k < K *, 1 <m< 4), 
bkm =bkm (k > K*, 1 < m < d c ), 
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and the Taylor expansion of ln(l + Ax) around |Ax| = 0, we obtain 

V fc=2 / V fc=i 1 / ,h=o a hJ 
K* 



^2k=2 a k- 



k=2 

K* K K 



1 O 7 7>'* i 1 7 1 J 



f(AK + h) 



k=2 k=K*+l k=l 

where higher order terms of dk such as d\ for k < K* and a\ for k > K* are omitted 
because they do not have effect on the position of the poles. Because f(x\bf.) is 
regular, 

K* K K* 

h xy ^iH)=J2^+ E ^+E^- ( 10 ) 

k=2 k=K* + l k=l 

Let the right-hand side of Eq. [10] be Hxy(®i( w )), an d consider a zeta function 
given by 

Ci(*) = J fl$(*iH)M*iH;^*iH- 

According to Lemma the positions of the poles of (i{z) are the same as those of 
Cxy(z). By using a blow-up $2 defined by 

u 2 =a 2 , 
u 2 u k =a k (2 < k < K*), 
u\u k =d k (k > K*), 

U2V k m =bkm (1 < k < K* , 1 < 171 < d c ), 

Vkm =bkm {k > K*, 1 < m < d c ), 
we obtain the following expression in the restricted area, 

7$2*l(Wl) 
+ ..., 

where f\ is a function consisting of the parameters except for 112, and a factor on 
|u 2 | is derived from the Jacobian of $2- The symmetric Dirichlet prior has a factor 
[\k=2 a k 1 ~ 1 i R ^he original parameter space. According to <3>2$i(afc) = u 2 u k f° r 
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k > K*, it has a factor the space of <f>2*&i{ , w), which indicates that 

Ci(z) has a pole at z = —(K* — l + K*d c )/2 — (K — K*)r] 1 . Considering the symmetry 
of the parameters in H^-(w), we determine that this pole is the largest and that 
its order is tuxy = 1, which proves Lemma [6j (End of Proof) 
The result for Ax is shown in the following lemma. 

Lemma 7 The largest pole of the zeta function (x{ z ) has the bound 

. ^ K* — 1 + K*d c (K-K*) Vl 
Ax < n = 1 • 



Proof of Lemma It is known [TJ] ; Section 7.8 of [32] that, in the restricted 
area Wi, there are positive constants C\ and C[ such that 



q(x) 

C[ I \ p(x\w) — q(x) > dx. 



Using we obtain 



ffx($iM) <C[ ! (j2a k {f(x\b k + b* k ) - f(x\h + b{)) 

J ^ k=2 

K* K* 

+ E a* (/(*!&* + 62) " MK)) + (1 - E a *) (/(^i + " 

fc=2 fc=2 
A' s 2 

+ E ak(f(x\bk) - f(x\bi + 61)) > dx. 

k>K* ' 

Because f(x\bk) is a regular model, the Taylor expansion at b k yields 

f(x\b k + b* k ) - f(x\b* k ) =b T k ^-f(x\bl) + .... 

db k 

Then, in W\, there is a positive constant such that 

K* K* K 



H x (^(w)) <C 2 \ Y4 + E^ + 6? + Y a l\- 

< k=2 k=2 k>K* ' 

Let the right-hand side be if x lg ($i(u>)), and consider a zeta function given by 
( 2 (z)= [ Hx S (®i(w)) z (p($ 1 (w);r])d$ 1 (w). 
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According to Lemma HI a pole z = —fi of the zeta function ( 2 (z) provides bounds 
for the largest pole of (x( z ), such that z = —Xx > —fJ>- By using a blow-up $ 3 
defined by 

u 2 =a 2 , 
u 2 u k =a k (2 < k < K*), 
u 2 u k =a k (k > K*), 
u 2 v km =b km (1 < k < K*, 1 < m < d c ), 
v km =hm (k > K*, 1 < m < d c ), 

we obtain 

where f 2 is a function of the parameters except for u 2 , and the factor on \u 2 \ is 
derived from the Jacobian of $ 3 . It is easy to confirm that the Dirichlet prior has a 
factor u [ 2 K ~ K * )(m ~ l) . Therefore, ( 2 (z) has a pole at z = -pi = —(K* - 1 + K*d L ) /2 - 
(K — K*)rji/2, which proves Lemma [71 (End of Proof) 

We are now prepared to prove Theorem [3j Proof of Theorem [3} According 
to Theorem [2], it holds that 

r^/ \ /a , Jnn /Inn 
D(n) =(X XY -\ X ) + o 



n \ n 

Combining Lemmas [6] and [7J we obtain 
D{n)>{ K "~ 1 + K " dc +(K-K-)r h " 1 + (K - K')^n 



2 2 2 | n 

Inn" 



n 



(K — K*)m Inn /Inn 
: ^ ^ — — h o 



n \ n 



which completes the proof. (End of Proof) 

According to Section 7.8 of [13], the Gaussian mixture does not satisfy the condi- 
tion (Al). More precisely, Hx{w) is not analytic in the case where the components 
are Gaussian. Even if the model does not directly satisfy the condition, the following 
relation is found in the area W±, 



g(ar) 

where C\ is a positive constant. Therefore, the theorem also holds in the Gaussian 
mixture. 
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6 Discussion 

6.1 Comparison of the Dominant Order of the Accuracy 

We investigated the asymptotic form of the error function. In the Reg-LV estimation 
such that K = K*, 



where w* is the unique point consisting of W l XY [E]. The dominant order is 1/n, 
and the coefficient is determined by the Fisher information matrices on p(x,y\w) 
and p(x\w). Theorem [2] implies that the largest possible order is lnn/n in the 
Sing-LV estimation, and Theorem [3] ensures that the order is the dominant one in 
many mixture models. This order change is adverse for the performance because 
the error converges more slowly to zero. In singular cases, the probability p(Y n \X n ) 
is constructed over the space Y n G K n while the true probability q(Y n \X n ) is over 
Y n G K* n . The size of the redundant space K n — K* n grows exponentially with the 
amount of training data. For realizing p(x,y\w*), where w* G Wxy, we must assign 
zero to the probabilities on the vast redundant space. The increased order reflects 
the cost of assigning these values. 

The Dirichlet prior distribution for the mixing ratio is qualitatively known to 
have a function controlling the number of available components, the so-called auto- 
matic relevance determination (ARD); a small hyperparameter tends to have a result 
with few components due to the shape of the distribution. Theorem [3] quantitatively 
shows an effect of the Dirichlet prior. The lower bound in the theorem mathemati- 
cally supports the ARD effect; the redundancy K — K* and the hyperparameter rji 
have a linear influence on the accuracy. 

Let us compare the dominant order of D(n) with that of the generalization error. 
We find that both Reg-OV and Sing-OV estimations have the same dominant order 
1/n as shown in Eq. [TJ while the redundancy and the hyperparameter affect the 
coefficients. Thus, changing the order is a unique phenomenon of the latent-variable 
estimation. 
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6.2 Effect of a Symmetric Parameter Space on the Accuracy 



We now consider if the symmetry of the parameter affects the accuracy. In latent- 
variable models, both the latent variable and the parameter are symmetric, which 
makes it difficult to interpret the estimation results. This is known as the label- 
switching problem. The definition of D(n) is precise for the latent variable; for 
example, the kth component a* k f(x\b* k ) of the true model must be estimated by the 
same ordered component a,kf(x\bk) of the learning model to realize D(n) = in the 
mixture model. This eliminates the symmetry of the latent variable. We can thus 
compare the error function without the parameter symmetry to D(n). 
Consider the ideal estimation 



PWl (Y n \X r 



Z Wl (X n ,Y r 



Z Wl {X n ,Y n ) 



Y, Yn z Wl (x^,Y-y 



lp(xi,yi\w)tp(w;rj)dw, 

Wi i= i 



where the estimation is constructed on the basis of only the proper area. The 
accuracy of this estimation must be better than the original p(Y n \X n ). The error 
function is given by 



DJn) =-E 



n 



XY 



In 



q(Y n \X n ) 
PWl (Y"\X* 



— i F r xY{n) - F rX (n) 
n 



where 



F rX y(n) = - nS XY + E XY - In Z Wl (X n , Y n 



F r x{n) = - nS x + E_ 



x 



lnY,Z Wl (X n ,Y n ) 



According to the definition of D r (n), there is no symmetry of either the latent 
variables or the parameters. As seen in the proofs of Lemmas [6] and d the essential 
poles providing the coefficients Xxy and Ax are calculated from the restricted area 
W\. We can thus conclude that F r xy{n) and F r x(n) have the same asymptotic 
forms as Fxv{n) and Fx{n), respectively. Therefore, D r (n) = D(n); the symmetry 
property does not asymptotically affect the accuracy. 
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Figure 1: The true parameter set W x (the left panel), and the parameter areas W\, 
W 2 , and W3 (the right panel) 

6.3 Phrase Transition and the ARD Effect of the Hyperpa- 
rameter 

Let us consider the ARD effect more carefully. For convenience, we will use the 
model settings from Example [TJ Let the parameter areas W\, W 2 , and W3 be the 
neighborhood of {a = 1, b\ = b*}, {bi = b 2 = b*}, and {a = 0, b 2 = b*}, respectively 
(Figure [TJ). The ARD effect results in the posterior distribution having a peak 
around W\ or W 3 because the redundant component is not eliminated in W 2 . Then, 
we can determine if the effect has appeared based on the position of the peak. 

The asymptotic form of the free energy provides the structure of the posterior 
distribution. More precisely, the posterior p(w\X n ) will have the maximum peak 
in the effective parameter area in which to derive the coefficient Ax- The free 
energy F(X n ) has an asymptotic form similar to the average energy Fx{n) [T3J 
Main Formula II], 

F(X n ) =nS(X n ) + X x \nn- (m x - 1) In Inn + O p (l), 

where S(X n ) = - Y^=i ^ n li x i)- According to Z(X n ) = exp(—F(X n )), the posterior 
distribution has the expression, 

P (w\x n )- Ui=iP( x i\ w M w ^) 



exp{— nS(X n ) — Ax Inn + o p (lnn)} 



Let us divide the neighborhood of W x into W e U W a , where W e is the effective 
parameter region for the calculation of Ax- A posterior value of the other region W Q 
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is described by 



p(W \X n ) = [ p{w\X n )dw 



J w 



exp{— nS(X n ) — Xx Inn + o p (lnn)} 
exp{— nS(X n ) — fix Inn + o p (\nn)} 



exp{— nS(X n ) — Xx Inn + o p (lnn)} 



where jj, x > Xx- Note that W Q would be the effective area if fix < X holds. Then, 
we can find that the posterior asymptotically has zero value in W D , which means 
that the peak exists in the effective area. 

As shown in the following lemma, the effective area of the posterior depends on 
the hyperparameter when the prior of the mixing ratio parameters is the Dirich- 
let distribution. Therefore, the ARD effect is quantitatively observed when the 
coefficient is exactly calculated. Theorem [3] focused on the nonnegativity of the 
leading-term coefficient in the error function. Restricting the model to the bino- 
mial mixture, where the exact form of Ax is studied [15], we now explain how the 
hyperparameter works on the posterior. 

The component of the binomial mixture model for x G {1, . . . , M} is expressed 

as 



where M is an integer such that K < M, and (M m) T is the number of combinations 
of M elements taken m at a time. Then, Xx can be rigorously derived. 

Lemma 8 Suppose that K = 2, K* = 1, and M > 2, where the true and the 
learning models are given by 




q(x) =f(x\b*), 
p(x\w) —af(x\bi) + (1 — a)f(x\b2) 



respectively. Then the largest pole of the zeta function (x( z ) is 




m < 1/2 

1 77i > 1/2 

m = 1/2, 




otherwise. 
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Proof: Define the shift transformation $4 given by 

a =1 — a, 

61 =(l-a)(6i -b*) + ab 2: 
h =b 2 - b*. 

This corresponds to focusing on the area W\ U W 2 . Following the calculation of [To] , 
we obtain 

H x ($ 4 (w)) =b\ + a 2 b\. 
Let the right-hand side be H X 2(w), and consider a zeta function given by 

( 3 (z) = [ H%(*4w))*<p(9 A (w);r,)dQ A (w). 
JW1UW2 

By using a blow-up $5 defined by 

a =i>iv 2j 
b~i =u\v x , 
h =ui, 

we obtain the following expression, 

( 3 {z)= / / 3 ($ 5 $4H)ttN 2 >($5*4(w);??) |«i|>i \d<£> 5 <f> 4 (w), 

•/*5*4(WiUW 2 ) 

where f 3 is a function of the parameter v 2 . The prior has a factor . Therefore, 
Cz{z) has poles at z = —3/4 and z — — (1 + r/i)/2, which are calculated from the 
factors u\ and Ui, respectively. Considering the cases u± — and fi = 0, we find 
that the effective area of the pole z = —3/4 is W 2 and that of z — —(1 + r] 1 )/2 is 
VFi. Due to the symmetry, the area W 2 U W% has the same poles. Then, the largest 
pole changes at 771 = 1/2, where the order of the pole is mx = 2. This completes 
the proof. (End of Proof) 

This lemma shows that the free energy and the peak of the posterior both change 
at rji = 1/2. A switch in the underlying function of the free energy is generally 
referred to as a phase transition. The ARD effect appears in the phase r\\ < 1/2 
since the posterior asymptotically has its support in W\ or W 3 . 

General mixture models also have a phase transition. 

Theorem 9 Under the same model settings as Theorem^ the average free energy 
Fx{n) has at least two phases: the phase that eliminates all redundant components 
when f]i is small, and the one that uses them when rji is sufficiently large. 

The proof is in the Appendix. 
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6.4 Behavior of Sampling Latent Variables 

Let us discuss the behavior of the Sing-LV estimation in the mixture models in 
practice. According to Eq. [31 the MCMC sampling of the Y n, s following p(Y n \X n ) 
is essential for the Bayes estimation. The following relation indicates that we do not 
need to calculate Z(X n ) and that the value of Z(X n , Y n ) determines the properties 
of the estimation: 

7(X n Y n ) 

p(X n ,Y n ) = Z(X n ,Y n ) (xp(Y n \X n ) - 1 ' ; 



Z(X n ) 

The expression of p(X n , Y n ) can be tractable with a conjugate prior, which marginal- 
izes out the parameter integral (5j [6] . 

We determine where the estimated distribution p(Y n \X n ) has its peak by using 
the relation F(X n , Y n ) = — In Z(X n , Y n ). Obviously, the label assignment Y n 
minimizing F(X n , Y n ) provides the peak. We describe this assignment as Y . Based 
on the same reason as for F(X n ) in the previous subsection, we obtain the following 
asymptotic form of F(X n , Y n ): 

F(X n , Y n ) =nS'(X n , Y n ) + X XY Inn + O p (l), 

S'(X n ,Y n ) ^V^.^K), 
n 

i=i 

w* G W XY = U aeS {w; a a(k) = a* k , b a(k) = b* k for 1 < k < K*}, 



where S is the set of injective functions from {1, . . . , K*} to {1, ... , K}. The leading 
term of the asymptotic F(X n , Y n ) is originally defined as nS(X n , Y n ), where 



1 " 

S(X n ,Y n ) =-J^lng 



i=i 

It is easily confirmed that S'(X n ,Y n ) = S(X n ,Y n ). In other words, 
~ Yli=i P(. x i> Vi\ w ) with w G Wx \ W XY cannot realize S(X n ,Y n ), which means 
that the redundant labels are eliminated in Y. 

It is necessary to emphasize that the calculation of p(Y n \X n ) based on sampling 
from p(w\X n ) following Eq. |2] can be inaccurate. According to Theorem [91 when 
f]i is large, the posterior has its peak in the area in which all the components are 
used. On the other hand, p{Y n \X n ) has its peak in the area that corresponds to 
the phase that eliminates the redundant components for any r]i. In such a case, the 
Monte Carlo integration in Eq. |2] is not reliable in practice since the value of R = 
[\™ =1 p(xi,yi\w)/p(xi\w) is unstable. For example, let us consider this inaccuracy 
under the model settings of Example [TJ As Lemma [8] shows, p(w\X n ) generates 
samples mainly in W 2 for r\\ > 1/2. However, R has valid values in W\ and W 3 , and 
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it is close to zero in other areas. Therefore, the value of p(Y n \X n ) calculated by Eq. 
[2] is not appropriate. 

Next, we investigate sampling from p(w,Y n \X n ). Its property indicates that 
the MCMC method using the samples from p(w,Y n \X n ) cannot asymptotically 
construct the posterior p(w\X n ). The Gibbs sampling in the MCMC method jH] is 
one of the representative techniques. 

[Gibbs Sampling for a Model with a Latent Variable] 

1. Initialize the parameter; 

2. Sample Y n based on p(Y n \w, X n ); 

3. Sample w based on p(w\Y n , X n ); 

4. Iterate by alternately updating Step 2 and Step 3. 

The sequence of {Y n ,w} obtained by this algorithm follows p(w,Y n \X n ). Ignor- 
ing Y n , we obtain the sequence {w}. This parameter sequence is assumed to be 
samples from the posterior because pG(w\X n ) = J2 Yn p(w,Y n \X n ) is theoretically 
equal to p{w\X n ). However, the practical value of pG{w\X n ) based on the Monte 
Carlo method can be different from p(w\X n ) when t]i is large. Let us consider the 
expression 

n K 

- \np(X n , Y n , w) = - In J] J] f{ Xi \b k ) 5 ^ - In <p(w; V ) 

i=l k=l 

K y n K 

= - n^ — \na k - ^ y^J y . k ]n f(xj\b k ) - ln<p(w;r}). 

k=l i=l k=l 

We wish to find a pair (w, Y n ) that minimizes this expression in the asymptotic 
case n — > oo. The third term does not have any asymptotic effect because it has the 
constant order n. The first term shows that Y k /n should be equal to a k . According to 
the second term, w e W l x is necessary. Considering S a (w) = — J2k=i a k m °fe f° r w e 
Wx, which corresponds to the convergent value of the first term, we obtain that w G 
W X y C W x minimizes S a (w). Therefore, pc(w\X n ) has its peak at w G W XY for 
any rji while the structure of the original p(w\X n ) depends on the phases of F(X n ) 
controlled by rji. As shown in Lemma (SJ the algebraic geometrical analysis provides 
the transition point from the phase that eliminates all redundant components. This 
point determines the area of rji that satisfies pG{w\X n ) = p(w\X n ). This property 
of the Gibbs sampling has been reported in a Gaussian mixture model [7]. The 
experimental results shows that the obtained sequence of {w} is localized in the 
area corresponding to W XY - 
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7 Conclusions 

The present paper clarifies the asymptotic accuracy of the Bayes latent-variable 
estimation. The dominant order is at most lnn/n, and its coefficient is determined 
by a positional relation between the largest poles of the zeta functions. According to 
the mixture-model case, it is suggested that the order is dominant and the coefficient 
is affected by the redundancy of the learning model and the hyperparameters. The 
accuracy of prediction can be approximated by methods such as the cross-validation 
and bootstrap methods. On the other hand, there is no approximation for the 
accuracy of latent-variable estimation, which indicates that the theoretical result 
plays a central role in evaluating the model and the estimation method. 
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Appendix 

In this section, we prove Theorem [91 

First, we introduce tighter upper bounds on Ax- 

Lemma 10 Under the same condition as in Theorem^ it holds that 




rji < d c , 
rji > d c . 



Proof: Consider the area W2, which is the neighborhood of 




K (2<k< K*) 

b* km (1 < k < K*, 1 < m < d c ) 

b* lm (k > K*,l < m < d c ). 



Let us define the shift transformation $ 5 given by 



a k =a k -a* k (2 < k < K*) 

bkm =bkm - Km i 1 < k < K * , 1 < m < d c) 

hm =hm ~ b* lm (k> K\ 1 < m < d c ). 
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Based on the Taylor expansion of f(x\bk + b* k ), there is a positive constant C3 such 
that 



< k=2 k=l ' 

Let the right-hand side be H x f(w), and consider a zeta function given by 



By using a blow-up $7 defined by 



u 2 =a 2 , 
u 2 u k =a k (2 < k < K*), 
u k =a k (k > K*), 
u 2 v km =bkm (1 < k < K, 1 < m < d c ), 

we obtain the following expression: 

Uz) = ! / 4 ($7$ 6 H)^v($7$6(w;);r/)|M 2 |^- 2+Xdc ^7$6(^), 

J $r*6(W2i) 

where f± is a function consisting of the parameters except for u 2 . Therefore, (4(z) 
has a pole at z — —(K* — 1 + Kd c )/2, which shows that 

K* - 1 + K*d c (K - K*)d c 
Ax < + 



2 2 

Compared to the result of Lemma [TJ we find that the bounds are tighter when 
r]i > d c , which proves the lemma. (End of Proof) 

Second, the following lemma shows the lower bound of Ax; 

Lemma 11 Under the same condition as in Theorem^ it holds that 

K* - 1 + K*d r 



Ax > 

2 

Proof: We can immediately obtain the inequality based on the minimality condition 
of q(x) and d > K* — 1 + K*d c . (End of Proof) 

Last, using these lemmas, we prove Theorem [9j As shown in the proofs of 
Lemmas [7] and [TUJ Ax is a linear function of rji due to the factor aJJ}~ 1 in the 
Dirichlet prior. The upper and lower bounds imply that, for rji close to zero, there 
exists a constant a such that 

Ax = OLT}\ + ft, 
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where /3 = (K* — 1 + K*d c )/2. Eliminated components appear in arji since their 
mixing ratio parameters converge to zero in the effective area, and the prior fac- 
tor aj}" 1 works on the calculation of the pole of (x( z )- The phase in the upper 
bounds eliminates all redundant components, and the constant term (3 in the above 
expression is the same value as that of the bounds. This means that the redundant 
components are all eliminated in this phase. On the other hand, the upper bounds 
also indicate that Xx must be a constant function for a sufficiently large 771. When 
there is no linear factor of 771 in Xx, all mixing ratio parameters converge to nonzero 
values; all components are used in this phase. Therefore, we have found the two 
phases, as desired. 
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